Numerical Linear Algebra

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 17, 2025

System information (for reproducibility):

versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2025spring/slides/08-numalgintro`
Status `~/Documents/github.com/ucla-biostat-257/2025spring/slides/08-numalgintro/Project.toml`
  [6e4b80f9] BenchmarkTools v1.6.0
  [0e44f5e4] Hwloc v3.3.0
  [bdcacae8] LoopVectorization v0.12.172
  [6f49c342] RCall v0.14.6
  [37e2e46d] LinearAlgebra v1.11.0
  [9a3f8284] Random v1.11.0

1 Introduction

  • Topics in numerical algebra:

    • BLAS
    • solve linear equations \(\mathbf{A} \mathbf{x} = \mathbf{b}\)
    • regression computations \(\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}\)
    • eigen-problems \(\mathbf{A} \mathbf{x} = \lambda \mathbf{x}\)
    • generalized eigen-problems \(\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}\)
    • singular value decompositions \(\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\)
    • iterative methods for numerical linear algebra
  • Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).

  • Our major goal (or learning objectives) is to

    1. know the complexity (flop count) of each task
    2. be familiar with the BLAS and LAPACK functions (what they do)
    3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
    4. understand the need for iterative methods
    5. apply appropriate numerical algebra tools to various statistical problems
  • All high-level languages (Julia, Matlab, Python, R) call BLAS and LAPACK for numerical linear algebra.

    • Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

2 BLAS

  • BLAS stands for basic linear algebra subprograms.

  • See netlib for a complete list of standardized BLAS functions.

  • There are many implementations of BLAS.

    • Netlib provides a reference implementation.
    • Matlab uses Intel’s MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS. However, not surprisingly, it only works on Intel CPUs.
    • Julia uses OpenBLAS. OpenBLAS is the best cross-platform, open source implementation. With the MKL.jl package, it’s also very easy to use MKL in Julia.
  • There are 3 levels of BLAS functions.

Level Example Operation Name Dimension Flops
1 \(\alpha \gets \mathbf{x}^T \mathbf{y}\) dot product \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\) \(2n\)
1 \(\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}\) axpy \(\alpha \in \mathbb{R}\), \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\) \(2n\)
2 \(\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}\) gaxpy \(\mathbf{A} \in \mathbb{R}^{m \times n}\), \(\mathbf{x} \in \mathbb{R}^n\), \(\mathbf{y} \in \mathbb{R}^m\) \(2mn\)
2 \(\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T\) rank one update \(\mathbf{A} \in \mathbb{R}^{m \times n}\), \(\mathbf{x} \in \mathbb{R}^n\), \(\mathbf{y} \in \mathbb{R}^m\) \(2mn\)
3 \(\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}\) matrix multiplication \(\mathbf{A} \in \mathbb{R}^{m \times p}\), \(\mathbf{B} \in \mathbb{R}^{p \times n}\), \(\mathbf{C} \in \mathbb{R}^{m \times n}\) \(2mnp\)
  • Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

3 Examples

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Some operations appear as level-3 but indeed are level-2.

Example 1. A common operation in statistics is column scaling or row scaling \[ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} \] where \(\mathbf{D}\) is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form
\[ \mathbf{X}^T \mathbf{W} \mathbf{X}, \] where \(\mathbf{W}\) is a diagonal matrix with observation weights on diagonal.

Column and row scalings are essentially level-2 operations!

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257) # seed
n = 2000
A = rand(n, n) # n-by-n matrix
d = rand(n)  # n vector
D = Diagonal(d) # diagonal matrix with d as diagonal
2000×2000 Diagonal{Float64, Vector{Float64}}:
 0.0416032   ⋅         ⋅         ⋅       …   ⋅         ⋅         ⋅ 
  ⋅         0.887679   ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅        0.102233   ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅        0.08407      ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅       …   ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
 ⋮                                       ⋱                      
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅       …   ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅          0.213471   ⋅         ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅        0.870533   ⋅ 
  ⋅          ⋅         ⋅         ⋅           ⋅         ⋅        0.318876
Dfull = diagm(d) # convert to full matrix
2000×2000 Matrix{Float64}:
 0.0416032  0.0       0.0       0.0      …  0.0       0.0       0.0
 0.0        0.887679  0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.102233  0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.08407     0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0      …  0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 ⋮                                       ⋱                      
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0      …  0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.0
 0.0        0.0       0.0       0.0         0.213471  0.0       0.0
 0.0        0.0       0.0       0.0         0.0       0.870533  0.0
 0.0        0.0       0.0       0.0         0.0       0.0       0.318876
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
# this is SLOW!
@benchmark $A * $Dfull
BenchmarkTools.Trial: 97 samples with 1 evaluation per sample.
 Range (minmax):  47.582 ms97.681 ms   GC (min … max): 0.00% … 0.74%
 Time  (median):     50.322 ms               GC (median):    1.34%
 Time  (mean ± σ):   51.877 ms ±  5.577 ms   GC (mean ± σ):  1.29% ± 0.57%
      ▃█▃ ▂ ▂ ▁                                               
  ▅▁▁▅███▇██▇█▁▁▁▅▁▁▅▅▅▁▅▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  47.6 ms      Histogram: log(frequency) by time      73.5 ms <
 Memory estimate: 30.53 MiB, allocs estimate: 3.
# dispatch to special method for diagonal matrix multiplication.
# columnwise scaling: O(n^2) flops
@benchmark $A * $D
BenchmarkTools.Trial: 1673 samples with 1 evaluation per sample.
 Range (minmax):  991.917 μs  4.351 ms   GC (min … max):  0.00% … 20.49%
 Time  (median):       2.979 ms                GC (median):    16.46%
 Time  (mean ± σ):     2.984 ms ± 382.462 μs   GC (mean ± σ):  15.75% ±  7.71%
                                      ▄▅█▇▇▅▇▂▃▁  ▁             
  ▂▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▄▄▄▅▅▆▅▅▄▄▅▇████████████▇█▆▅▅▄▄▃▄▃▃▃▂ ▄
  992 μs           Histogram: frequency by time         3.96 ms <
 Memory estimate: 30.53 MiB, allocs estimate: 3.
# Or we can use broadcasting (with recycling)
@benchmark $A .* transpose($d)
BenchmarkTools.Trial: 1713 samples with 1 evaluation per sample.
 Range (minmax):  980.583 μs  4.282 ms   GC (min … max):  0.00% … 30.85%
 Time  (median):       2.932 ms                GC (median):    16.91%
 Time  (mean ± σ):     2.916 ms ± 346.419 μs   GC (mean ± σ):  15.96% ±  8.30%
                                         ▃▂▆█▅▃▂ ▁             
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▄▅▇▄▆▆▅▄▄▃▅▆█████████▇▇▆▆▅▄▄▃▃▃▃ ▄
  981 μs           Histogram: frequency by time         3.72 ms <
 Memory estimate: 30.53 MiB, allocs estimate: 3.
# in-place: avoid allocate space for result
# rmul!: compute matrix-matrix product A*B, overwriting A, and return the result.
@benchmark rmul!($A, $D)
BenchmarkTools.Trial: 8694 samples with 1 evaluation per sample.
 Range (minmax):  508.667 μs758.541 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     558.208 μs                GC (median):    0.00%
 Time  (mean ± σ):   570.563 μs ±  28.364 μs   GC (mean ± σ):  0.00% ± 0.00%
               ▂▅▆██▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▂▁▂▂▁▁▁▂▁▁▁▁               ▂
  ▄▂▆▅▄▄▅▆▅▄▄▄▅██████████████████████████████████▇█▇▇▇▇▆▇▇▆▅▇ █
  509 μs        Histogram: log(frequency) by time        676 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.
# In-place broadcasting 
@benchmark $A .= $A .* transpose($d)
BenchmarkTools.Trial: 3125 samples with 1 evaluation per sample.
 Range (minmax):  1.462 ms  5.809 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.562 ms                GC (median):    0.00%
 Time  (mean ± σ):   1.598 ms ± 222.062 μs   GC (mean ± σ):  0.00% ± 0.00%
             ▁▄▇▇█▄▄▃▃▂▂▂▂▁▂▂▂▂▂▁▁▁                         ▁
  ▄▃▄▃▁▄▃▁▃▃▇████████████████████████▇▆▇▅▆▅▆▄▆▅▃▁▅▃▃▄▄▃▄▃▅▃ █
  1.46 ms      Histogram: log(frequency) by time      1.84 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

Exercise: Try @turbo (SIMD) and @tturbo (multi-threaded SIMD) from LoopVectorization.jl package.

Note: In R or Matlab, diag(d) will create a full matrix. Be cautious using diag function: do we really need a full diagonal matrix?

using RCall

R"""
d <- runif(5)
diag(d)
"""
RObject{RealSxp}
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 0.6874543 0.0000000 0.0000000 0.0000000 0.0000000
[2,] 0.0000000 0.8877676 0.0000000 0.0000000 0.0000000
[3,] 0.0000000 0.0000000 0.2987494 0.0000000 0.0000000
[4,] 0.0000000 0.0000000 0.0000000 0.9595963 0.0000000
[5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.2842544

This works only when Matlab is installed.

#| eval: false
using MATLAB

mat"""
d = rand(5, 1)
diag(d)

Example 2. Innter product between two matrices \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}\) is often written as \[ \text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}). \] They appear as level-3 operation (matrix multiplication with \(O(m^2n)\) or \(O(mn^2)\) flops).

Random.seed!(123)
n = 2000
A, B = randn(n, n), randn(n, n)

# slow way to evaluate tr(A'B): 2mn^2 flops
@benchmark tr(transpose($A) * $B)
BenchmarkTools.Trial: 95 samples with 1 evaluation per sample.
 Range (minmax):  47.436 ms79.546 ms   GC (min … max): 0.00% … 1.05%
 Time  (median):     52.123 ms               GC (median):    1.60%
 Time  (mean ± σ):   52.616 ms ±  3.310 ms   GC (mean ± σ):  1.51% ± 0.94%
                 ▅▇██▂                                      
  ▃▁▃▁▁▁▁▁▃▁▁▅▆▃▆█████▃▆▃▃▃▃▅▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  47.4 ms         Histogram: frequency by time        62.1 ms <
 Memory estimate: 30.53 MiB, allocs estimate: 3.

But \(\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>\). The latter is level-1 BLAS operation with \(O(mn)\) flops.

# smarter way to evaluate tr(A'B): 2mn flops
@benchmark dot($A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  414.166 μs 14.183 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     434.375 μs                GC (median):    0.00%
 Time  (mean ± σ):   454.625 μs ± 242.236 μs   GC (mean ± σ):  0.00% ± 0.00%
   ▁▇█▇▃▁                                                       
  ▂██████▆▅▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  414 μs           Histogram: frequency by time          644 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

Example 3. Similarly \(\text{diag}(\mathbf{A}^T \mathbf{B})\) can be calculated in \(O(mn)\) flops.

# slow way to evaluate diag(A'B): O(n^3)
@benchmark diag(transpose($A) * $B)
BenchmarkTools.Trial: 95 samples with 1 evaluation per sample.
 Range (minmax):  47.640 ms71.521 ms   GC (min … max): 0.00% … 1.28%
 Time  (median):     51.493 ms               GC (median):    1.68%
 Time  (mean ± σ):   52.748 ms ±  3.421 ms   GC (mean ± σ):  1.52% ± 0.69%
             ▄█▃                                               
  ▃▃▁▁▁▁▁▄▄▃▄███▆█▃▄▄▅▁▃▃▃▃▃▁▄▃▁▁▃▃▄▁▁▃▃▁▁▁▁▁▁▁▃▁▁▄▁▁▁▁▁▁▁▃ ▁
  47.6 ms         Histogram: frequency by time        63.9 ms <
 Memory estimate: 30.55 MiB, allocs estimate: 6.
# smarter way to evaluate diag(A'B): O(n^2)
@benchmark Diagonal(vec(sum($A .* $B, dims = 1)))
BenchmarkTools.Trial: 1195 samples with 1 evaluation per sample.
 Range (minmax):  1.607 ms  6.003 ms   GC (min … max):  0.00% … 37.05%
 Time  (median):     4.195 ms                GC (median):    17.03%
 Time  (mean ± σ):   4.177 ms ± 499.270 μs   GC (mean ± σ):  15.16% ±  8.50%
                                        ▂▄█▃▃▃▃▃▃▂            
  ▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▅▆▆▇▆▇▅▅▄▅▆▇███████████▆▅▆▅▅▄▃▃▂ ▄
  1.61 ms         Histogram: frequency by time        5.24 ms <
 Memory estimate: 30.55 MiB, allocs estimate: 7.

To get rid of allocation of intermediate arrays at all, we can just write a double loop or use dot function.

function diag_matmul!(d, A, B)
    m, n = size(A)
    @assert size(B) == (m, n) "A and B should have same size"
    fill!(d, 0)
    for j in 1:n, i in 1:m
        d[j] += A[i, j] * B[i, j]
    end
    Diagonal(d)
end

d = zeros(eltype(A), size(A, 2))
@benchmark diag_matmul!($d, $A, $B)
BenchmarkTools.Trial: 1482 samples with 1 evaluation per sample.
 Range (minmax):  3.339 ms 3.507 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     3.350 ms               GC (median):    0.00%
 Time  (mean ± σ):   3.370 ms ± 38.298 μs   GC (mean ± σ):  0.00% ± 0.00%
  █▇ ▂                                                        
  ██▆██▅▃▃▃▂▃▂▁▁▂▂▁▁▂▂▂▂▂▁▂▂▂▁▁▂▂▂▂▂▁▂▂▁▂▂▂▃▁▂▂▂▂▂▂▂▂▁▁▁▁▁ ▂
  3.34 ms        Histogram: frequency by time        3.47 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

Exercise: Try @turbo (SIMD) and @tturbo (multi-threaded SIMD) from LoopVectorization.jl package.

4 Memory hierarchy and level-3 fraction

Key to high performance is effective use of memory hierarchy. True on all architectures.

  • Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

Source: https://cs.brown.edu/courses/csci1310/2020/assign/labs/lab4.html

  • In Julia, we can query the CPU topology by the Hwloc.jl package. For example, this laptop runs an Apple M2 Max chip with 4 efficiency cores and 8 performance cores.
using Hwloc

topology_graphical()
/------------------------------------------------------------------------------------------------------------------------------------------------------------------\
| Machine (2213MB total)                                                                                                                                           |
|                                                                                                                                                                  |
| /----------------------------------------------------------------------------------------------------------------------------------------\  /------------------\ |
| | Package L#0                                                                                                                            |  | CoProc opencl0d0 | |
| |                                                                                                                                        |  |                  | |
| | /------------------------------------------------------------------------------------------------------------------------------------\ |  | 38 compute units | |
| | | NUMANode L#0 P#0 (2213MB)                                                                                                          | |  |                  | |
| | \------------------------------------------------------------------------------------------------------------------------------------/ |  | 72 GB            | |
| |                                                                                                                                        |  \------------------/ |
| | /----------------------------------------------------------------\  /----------------------------------------------------------------\ |                       |
| | | L2 (4096KB)                                                    |  | L2 (16MB)                                                      | |                       |
| | \----------------------------------------------------------------/  \----------------------------------------------------------------/ |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\ |                       |
| | | L1d (64KB)  |  | L1d (64KB)  |  | L1d (64KB)  |  | L1d (64KB)  |  | L1d (128KB) |  | L1d (128KB) |  | L1d (128KB) |  | L1d (128KB) | |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/ |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\ |                       |
| | | L1i (128KB) |  | L1i (128KB) |  | L1i (128KB) |  | L1i (128KB) |  | L1i (192KB) |  | L1i (192KB) |  | L1i (192KB) |  | L1i (192KB) | |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/ |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\  /-------------\ |                       |
| | | Core L#0    |  | Core L#1    |  | Core L#2    |  | Core L#3    |  | Core L#4    |  | Core L#5    |  | Core L#6    |  | Core L#7    | |                       |
| | |             |  |             |  |             |  |             |  |             |  |             |  |             |  |             | |                       |
| | | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ | |                       |
| | | | PU L#0  | |  | | PU L#1  | |  | | PU L#2  | |  | | PU L#3  | |  | | PU L#4  | |  | | PU L#5  | |  | | PU L#6  | |  | | PU L#7  | | |                       |
| | | |         | |  | |         | |  | |         | |  | |         | |  | |         | |  | |         | |  | |         | |  | |         | | |                       |
| | | |   P#0   | |  | |   P#1   | |  | |   P#2   | |  | |   P#3   | |  | |   P#4   | |  | |   P#5   | |  | |   P#6   | |  | |   P#7   | | |                       |
| | | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ | |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/  \-------------/ |                       |
| |                                                                                                                                        |                       |
| | /----------------------------------------------------------------\                                                                     |                       |
| | | L2 (16MB)                                                      |                                                                     |                       |
| | \----------------------------------------------------------------/                                                                     |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\                                                                     |                       |
| | | L1d (128KB) |  | L1d (128KB) |  | L1d (128KB) |  | L1d (128KB) |                                                                     |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/                                                                     |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\                                                                     |                       |
| | | L1i (192KB) |  | L1i (192KB) |  | L1i (192KB) |  | L1i (192KB) |                                                                     |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/                                                                     |                       |
| |                                                                                                                                        |                       |
| | /-------------\  /-------------\  /-------------\  /-------------\                                                                     |                       |
| | | Core L#8    |  | Core L#9    |  | Core L#10   |  | Core L#11   |                                                                     |                       |
| | |             |  |             |  |             |  |             |                                                                     |                       |
| | | /---------\ |  | /---------\ |  | /---------\ |  | /---------\ |                                                                     |                       |
| | | | PU L#8  | |  | | PU L#9  | |  | | PU L#10 | |  | | PU L#11 | |                                                                     |                       |
| | | |         | |  | |         | |  | |         | |  | |         | |                                                                     |                       |
| | | |   P#8   | |  | |   P#9   | |  | |  P#10   | |  | |  P#11   | |                                                                     |                       |
| | | \---------/ |  | \---------/ |  | \---------/ |  | \---------/ |                                                                     |                       |
| | \-------------/  \-------------/  \-------------/  \-------------/                                                                     |                       |
| \----------------------------------------------------------------------------------------------------------------------------------------/                       |
\------------------------------------------------------------------------------------------------------------------------------------------------------------------/
  • For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

  • Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
    Answer: use high-level BLAS as much as possible.

BLAS Dimension Mem. Refs. Flops Ratio
Level 1: \(\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}\) \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\) \(3n\) \(2n\) 3:2
Level 2: \(\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}\) \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\), \(\mathbf{A} \in \mathbb{R}^{n \times n}\) \(n^2\) \(2n^2\) 1:2
Level 3: \(\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}\) \(\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}\) \(4n^2\) \(2n^3\) 2:n
  • Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.

Source: Jack Dongarra’s slides.

  • A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

  • To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

5 Effect of data layout

  • Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

  • Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).

  • Cache line is the minimum amount of cache which can be loaded and stored to memory.

    • x86 CPUs: 64 bytes
    • ARM CPUs: 32 bytes

  • In Julia, we can query the cache line size by Hwloc.jl.
# Apple Silicon (M1/M2 chips) don't have L3 cache
Hwloc.cachelinesize()
ErrorException: Your system doesn't seem to have an L3 cache.
Your system doesn't seem to have an L3 cache.



Stacktrace:

 [1] cachelinesize()

   @ Hwloc ~/.julia/packages/Hwloc/IvkQ5/src/highlevel_api.jl:392

 [2] top-level scope

   @ ~/Documents/github.com/ucla-biostat-257/2025spring/slides/08-numalgintro/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X43sZmlsZQ==.jl:2
  • Accessing column-major stored matrix by rows (\(ij\) looping) causes lots of cache misses.

  • Take matrix multiplication as an example \[ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. \] Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.

    • jki or kji looping:
# inner most loop
for i in 1:m
    C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
- `ikj` or `kij` looping:
# inner most loop        
for j in 1:n
    C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
  • ijk or jik looping:
# inner most loop        
for k in 1:p
    C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
  • We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant A Stride B Stride C Stride
\(jki\) or \(kji\) Unit 0 Unit
\(ikj\) or \(kij\) 0 Non-Unit Non-Unit
\(ijk\) or \(jik\) Non-Unit Unit 0

Apparently the variants \(jki\) or \(kji\) are preferred.

"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        @inbounds for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        @inbounds for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        @inbounds for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        @inbounds for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        @inbounds for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        @inbounds for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

using Random

Random.seed!(123)
m, p, n = 2000, 100, 2000
A = rand(m, p)
B = rand(p, n)
C = zeros(m, n);
  • \(jki\) and \(kji\) looping:
using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")
BenchmarkTools.Trial: 86 samples with 1 evaluation per sample.
 Range (minmax):  57.407 ms82.670 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     57.860 ms               GC (median):    0.00%
 Time  (mean ± σ):   58.422 ms ±  2.735 ms   GC (mean ± σ):  0.00% ± 0.00%
    ▂  ▇█                                                     
  ▆▇█▅▇██▃▆▅▇▆▆▇▅▁▁▁▁▃▁▃▅▁▃▁▁▁▁▁▃▁▁▁▁▃▁▃▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  57.4 ms         Histogram: frequency by time        61.1 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark matmul_by_loop!($A, $B, $C, "kji")
BenchmarkTools.Trial: 26 samples with 1 evaluation per sample.
 Range (minmax):  187.500 ms225.509 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     197.176 ms                GC (median):    0.00%
 Time  (mean ± σ):   198.485 ms ±   7.839 ms   GC (mean ± σ):  0.00% ± 0.00%
      ▃ ▃     ▃     ▃  █ ▃                                      
  ▇▁▁▁█▇█▁▇▁▇▁█▇▁▁▁█▁▁█▇█▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  188 ms           Histogram: frequency by time          226 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
  • \(ikj\) and \(kij\) looping:
@benchmark matmul_by_loop!($A, $B, $C, "ikj")
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (minmax):  513.649 ms517.241 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     514.513 ms                GC (median):    0.00%
 Time  (mean ± σ):   514.746 ms ±   1.078 ms   GC (mean ± σ):  0.00% ± 0.00%
  █    ▁  ▁            ▁      █                              ▁  
  █▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  514 ms           Histogram: frequency by time          517 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark matmul_by_loop!($A, $B, $C, "kij")
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (minmax):  511.840 ms522.688 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     517.032 ms                GC (median):    0.00%
 Time  (mean ± σ):   516.743 ms ±   3.278 ms   GC (mean ± σ):  0.00% ± 0.00%
  █       █  █        █       █   █ █           █            █  
  █▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  512 ms           Histogram: frequency by time          523 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
  • \(ijk\) and \(jik\) looping:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 21 samples with 1 evaluation per sample.
 Range (minmax):  237.517 ms247.732 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     242.527 ms                GC (median):    0.00%
 Time  (mean ± σ):   242.632 ms ±   3.128 ms   GC (mean ± σ):  0.00% ± 0.00%
  ▁      █ ▁  ▁▁ ▁     ▁   ▁ ▁   ▁  ▁ █    ▁       ▁ ▁ ▁▁   ▁  
  █▁▁▁▁▁▁█▁█▁▁██▁█▁▁▁▁▁█▁▁▁█▁██▁▁█▁▁█▁█▁▁▁▁█▁▁▁▁▁▁▁█▁█▁██▁▁▁█ ▁
  238 ms           Histogram: frequency by time          248 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
BenchmarkTools.Trial: 21 samples with 1 evaluation per sample.
 Range (minmax):  237.851 ms249.396 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     241.580 ms                GC (median):    0.00%
 Time  (mean ± σ):   242.233 ms ±   3.356 ms   GC (mean ± σ):  0.00% ± 0.00%
      ▃                                                      
  ▇▁▁▁█▇▇▁▁▁▇▁▁▇▁▇▁▁▇█▁▁▇▁▁▁▁▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▇▁▁▁▁▁▇ ▁
  238 ms           Histogram: frequency by time          249 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
  • Question: Can our loop beat BLAS? Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
@benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 1742 samples with 1 evaluation per sample.
 Range (minmax):  2.549 ms 24.514 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.721 ms                GC (median):    0.00%
 Time  (mean ± σ):   2.867 ms ± 675.298 μs   GC (mean ± σ):  0.00% ± 0.00%
  ▅▅▇▇█▇▅▆▅▄▃▃▂▂▂▂▁▁▁   ▁▁   ▁                              ▁
  ██████████████████████████▇██▇▇▇▇▇▆▇▇▅▇▅▅▅▅▅▄▄▄▅▅▄▅▁▁▅▅▄▅ █
  2.55 ms      Histogram: log(frequency) by time      4.23 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# direct call of BLAS wrapper function
@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
BenchmarkTools.Trial: 1748 samples with 1 evaluation per sample.
 Range (minmax):  2.546 ms  9.545 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.705 ms                GC (median):    0.00%
 Time  (mean ± σ):   2.858 ms ± 530.714 μs   GC (mean ± σ):  0.00% ± 0.00%
  ▃▇█▆▅▅▄▂▂▁▁▁                                              ▁
  ████████████▇██▇▇█▅▆▆▇▆▆▅▆▆▅▇▇▅▅▄▅▅▄▅▁▄▁▄▁▅▅▁▄▁▁▅▄▄▄▁▄▁▄▄ █
  2.55 ms      Histogram: log(frequency) by time      5.33 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

Question (again): Can our loop beat BLAS?

Exercise: Annotate the loop in matmul_by_loop! by @turbo and @tturbo (multi-threading) and benchmark again.

6 BLAS in R

  • Tip for R users. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
using RCall

R"""
sessionInfo()
"""
RObject{VecSxp}
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] C

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.4.2
R"""
library(dplyr)
library(bench)
A <- $A
B <- $B
bench::mark(A %*% B) %>%
  print(width = Inf)
""";
┌ Warning: RCall.jl: 
│ Attaching package: 'dplyr'
│ 
│ The following objects are masked from 'package:stats':
│ 
│     filter, lag
│ 
│ The following objects are masked from 'package:base':
│ 
│     intersect, setdiff, setequal, union
│ 
└ @ RCall /Users/huazhou/.julia/packages/RCall/0ggIQ/src/io.jl:172
# A tibble: 1 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 A %*% B       127ms    128ms      7.49    30.5MB     7.49     4     4
  total_time result                memory             time          
    <bch:tm> <list>                <list>             <list>        
1      534ms <dbl [2,000 x 2,000]> <Rprofmem [1 x 3]> <bench_tm [4]>
  gc              
  <list>          
1 <tibble [4 x 3]>
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall /Users/huazhou/.julia/packages/RCall/0ggIQ/src/io.jl:172
  • Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.

  • Matlab uses MKL. Usually it’s very hard to beat Matlab in terms of linear algebra.

using MATLAB

mat"""
f = @() $A * $B;
timeit(f)
"""
ArgumentError: ArgumentError: Package MATLAB not found in current path.
- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.
ArgumentError: Package MATLAB not found in current path.

- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.



Stacktrace:

  [1] macro expansion

    @ ./loading.jl:2296 [inlined]

  [2] macro expansion

    @ ./lock.jl:273 [inlined]

  [3] __require(into::Module, mod::Symbol)

    @ Base ./loading.jl:2271

  [4] #invoke_in_world#3

    @ ./essentials.jl:1089 [inlined]

  [5] invoke_in_world

    @ ./essentials.jl:1086 [inlined]

  [6] require(into::Module, mod::Symbol)

    @ Base ./loading.jl:2260

  [7] eval

    @ ./boot.jl:430 [inlined]

  [8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)

    @ Base ./loading.jl:2734

  [9] #invokelatest#2

    @ ./essentials.jl:1055 [inlined]

 [10] invokelatest

    @ ./essentials.jl:1052 [inlined]

 [11] (::VSCodeServer.var"#217#218"{VSCodeServer.NotebookRunCellArguments, String})()

    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:24

 [12] withpath(f::VSCodeServer.var"#217#218"{VSCodeServer.NotebookRunCellArguments, String}, path::String)

    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/repl.jl:276

 [13] notebook_runcell_request(conn::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, params::VSCodeServer.NotebookRunCellArguments)

    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:13

 [14] dispatch_msg(x::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, dispatcher::VSCodeServer.JSONRPC.MsgDispatcher, msg::Dict{String, Any})

    @ VSCodeServer.JSONRPC ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/JSONRPC/src/typed.jl:67

 [15] serve_notebook(pipename::String, debugger_pipename::String, outputchannel_logger::Base.CoreLogging.SimpleLogger; error_handler::var"#5#10"{String})

    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/serve_notebook.jl:147

 [16] top-level scope

    @ ~/.vscode/extensions/julialang.language-julia-1.127.2/scripts/notebook/notebook.jl:35

7 Avoid memory allocation: some examples

7.1 Transposing matrix is an expensive memory operation

In R, the command

t(A) %*% x

will first transpose A then perform matrix multiplication, causing unnecessary memory allocation

using Random, LinearAlgebra, BenchmarkTools
Random.seed!(123)

n = 1000
A = rand(n, n)
x = rand(n);
R"""
A <- $A
x <- $x
bench::mark(t(A) %*% x) %>%
  print(width = Inf)
""";
# A tibble: 1 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 t(A) %*% x   2.05ms   2.37ms      415.    7.64MB     85.1   151    31
  total_time result            memory             time            
    <bch:tm> <list>            <list>             <list>          
1      364ms <dbl [1,000 x 1]> <Rprofmem [3 x 3]> <bench_tm [182]>
  gc                
  <list>            
1 <tibble [182 x 3]>

Julia is avoids transposing matrix whenever possible. The Transpose type only creates a view of the transpose of matrix data.

typeof(transpose(A))
Transpose{Float64, Matrix{Float64}}
fieldnames(typeof(transpose(A)))
(:parent,)
# same data in tranpose(A) and original matrix A
pointer(transpose(A).parent), pointer(A)
(Ptr{Float64} @0x00000001190c8000, Ptr{Float64} @0x00000001190c8000)
# dispatch to BLAS
# does *not* actually transpose the matrix
@benchmark transpose($A) * $x
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  22.750 μs 2.966 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     23.167 μs               GC (median):    0.00%
 Time  (mean ± σ):   25.605 μs ± 43.023 μs   GC (mean ± σ):  0.00% ± 0.00%
  █▆▄▁ ▁▁▁▁                                                  ▁
  █████████▇▇▆▇▆▅▆▆▅▅▄▅▄▅▃▁▃▄▃▄▅▃▁▃▃▃▄▃▄▃▃▁▁▄▃▁▅▄▅▆▇▇▇▇▇▅▆▆ █
  22.8 μs      Histogram: log(frequency) by time      54.2 μs <
 Memory estimate: 8.06 KiB, allocs estimate: 3.
# pre-allocate result
out = zeros(size(A, 2))
@benchmark mul!($out, transpose($A), $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  22.584 μs432.959 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     22.791 μs                GC (median):    0.00%
 Time  (mean ± σ):   24.214 μs ±   9.847 μs   GC (mean ± σ):  0.00% ± 0.00%
      ▁▂▁                                                   ▁
  █▇▇█████▇▇▅▇█▇▅▆▅▄▄▄▄▄▄▃▃▃▄▃▄▄▁▄▁▃▃▃▁▄▁▃▁▁▁▄▄▁▄▅▆█▇▆▆▅▄▄▄▄ █
  22.6 μs       Histogram: log(frequency) by time      54.5 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.
# or call BLAS wrapper directly
@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  22.750 μs186.541 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     22.917 μs                GC (median):    0.00%
 Time  (mean ± σ):   23.785 μs ±   5.861 μs   GC (mean ± σ):  0.00% ± 0.00%
      ▁▁                                                    ▁
  ▇▆▆████▇█▇▆▇▇▇▅▆▆▅▄▄▄▅▃▄▃▃▁▄▃▁▃▄▁▁▃▁▄▁▁▃▃▁▃▃▁▁▁▁▁▁▃▁▁▁▅▆▆ █
  22.8 μs       Histogram: log(frequency) by time      49.8 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

7.2 Broadcast (dot operation) fuses loops and avoids memory allocation

Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command

max(abs(X), abs(Y))

will create two intermediate arrays and then one result array.

using RCall

Random.seed!(123)
X, Y = rand(1000, 1000), rand(1000, 1000)

R"""
library(dplyr)
library(bench)
bench::mark(max(abs($X), abs($Y))) %>%
  print(width = Inf)
""";
# A tibble: 1 x 13
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 max(abs(`#JL`$X), abs(`#JL`$Y))   3.45ms   3.72ms      268.    15.3MB     278.
  n_itr  n_gc total_time result    memory             time            
  <int> <dbl>   <bch:tm> <list>    <list>             <list>          
1    57    59      213ms <dbl [1]> <Rprofmem [2 x 3]> <bench_tm [116]>
  gc                
  <list>            
1 <tibble [116 x 3]>

In Julia, dot operations are fused so no intermediate arrays are created.

# no intermediate arrays created, only result array created
@benchmark max.(abs.($X), abs.($Y))
BenchmarkTools.Trial: 5379 samples with 1 evaluation per sample.
 Range (minmax):  268.291 μs  3.785 ms   GC (min … max):  0.00% … 79.01%
 Time  (median):     717.291 μs                GC (median):     0.00%
 Time  (mean ± σ):   927.827 μs ± 429.730 μs   GC (mean ± σ):  23.01% ± 23.15%
            ▅▇█▆▅▄▃▁            ▂▃▃▃▃▂▂▂▂▂▁▁ ▁                 ▂
  ▆▄▁▁▁▁▁▁▁██████████▅▃▁▁▅▁▁▄▇█████████████████████▆▇▇▇█▇▇▇▇▇ █
  268 μs        Histogram: log(frequency) by time       2.32 ms <
 Memory estimate: 7.66 MiB, allocs estimate: 3.

Pre-allocating result array gets rid of memory allocation at all.

# no memory allocation at all!
Z = zeros(size(X)) # zero matrix of same size as X
@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  237.083 μs403.917 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     245.980 μs                GC (median):    0.00%
 Time  (mean ± σ):   250.909 μs ±  16.632 μs   GC (mean ± σ):  0.00% ± 0.00%
  ▂▇█▆▇▇▇▇▆▄▄▃▃▃▂▂▂▂▁▁▁▁▁▁ ▁ ▁ ▁   ▁    ▁                     ▃
  ████████████████████████████████████▇██████▇▆█▇▇▇▇▇▇▇▅▆▅▆▄▅ █
  237 μs        Histogram: log(frequency) by time        320 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

Exercise: Annotate the broadcasting by @turbo and @tturbo (multi-threading) and benchmark again.

7.3 Views

View avoids creating extra copy of matrix data.

Random.seed!(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  32.000 μs  8.062 ms   GC (min … max):  0.00% … 98.95%
 Time  (median):     57.333 μs                GC (median):     0.00%
 Time  (mean ± σ):   81.368 μs ± 250.642 μs   GC (mean ± σ):  28.88% ± 11.31%
     ▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▁▂▂▁▂▂▂▂▂▁▁▁▂▂▁▂▁▂▂▁▂▂▂▂▂▂▂▂▂ ▂
  32 μs           Histogram: frequency by time         1.08 ms <
 Memory estimate: 512.08 KiB, allocs estimate: 3.
# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  63.750 μs109.834 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     64.000 μs                GC (median):    0.00%
 Time  (mean ± σ):   64.855 μs ±   2.752 μs   GC (mean ± σ):  0.00% ± 0.00%
  █▄▁          ▁▃▁                                            ▁
  ████▇▅▅▅▄▄▂▄██████▇▇▆▆▄▅▅▆▅▆▆▆▇▇▆▇▆▆▆▆▅▅▆▆▆▅▆▄▆▅▆▅▄▅▃▄▃▃▂▄ █
  63.8 μs       Histogram: log(frequency) by time      78.2 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

The @views macro, which can be useful in some operations.

@benchmark @views sum($A[1:2:500, 1:2:500])
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  63.709 μs123.250 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     64.000 μs                GC (median):    0.00%
 Time  (mean ± σ):   65.242 μs ±   3.492 μs   GC (mean ± σ):  0.00% ± 0.00%
  █▂▂▃▁    ▁▂▁▁▁                                             ▁
  ██████▇▆████████▇▇▇▇▇▆▇▇▇██▇▇▇▆▆▆▆▆▅▆▆▆▆▆▆▃▆▅▅▄▅▅▅▄▆▄▅▅▅▅▄ █
  63.7 μs       Histogram: log(frequency) by time      81.3 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.